Method for accelerating the execution of speech recognition neural networks and the related speech recognition device

ABSTRACT

A method for accelerating neural network execution ( 4 ) in a speech recognition system, specifically for recognition of words contained in one or more subsets of a general vocabulary, involves the following steps.—at the recognition system initialisation phase, calculating the union of vocabulary subsets and determining the acoustic-phonetic units required for recognising the words contained in that union; re-compacting the neural network eliminating all the weighted connections afferent to computation output units corresponding to unnecessary acoustic-phonetic units;—executing unnecessary acoustic-phonetic units;—executing only the re-compacted network at each instant of time.

TECHNICAL FIELD

This invention refers to automatic signal recognition systems, andspecifically regards a method for accelerating neural network executionand a speech recognition device using that method.

BACKGROUND ART

An automatic speech recognition process can be schematically describedas a number of modules placed in series between a speech signal as inputand a sequence of recognised words as output:

-   -   a first signal processing module, which acquires the input        speech signal, transforming it from analogue to digital and        suitably sampling it;    -   a second feature extraction module, which computes a set of        parameters that well describe the features of the speech signal        in terms of its recognition. This module uses, for example,        spectral analysis (DFT) followed by grouping in Mel bands and a        discrete transformed cosine (Mel based Cepstral Coefficients);    -   a third module that uses temporal alignment algorithms and        acoustic pattern matching; for example a Viterbi algorithm is        used for temporal alignment, that is to say it manages the time        distortion introduced by the different rates of speech, while        for pattern matching it is possible to use prototype distances,        the likelihood of Markovian states or a posteriori probability        generated by neural networks;    -   a fourth linguistic analysis module for extracting the best word        sequence (present only for recognition of continual speech); for        example, it is possible to use models with bigrams or trigrams        of words or regular grammar.

In the above model the neural networks enter into the third module asregards the acoustic pattern matching aspect, and are used forestimating the probability that a portion of speech signal belongs to aphonetic class in a set given a priori, or constitutes a whole word in aset of prefixed words.

Neural networks have an architecture that has certain similarities tothe structure of the cerebral cortex, hence the name neural. A neuralnetwork is made up of many simple parallel computation units, calledneurones, densely connected by a network of weighted connections, calledsynapses, that constitute a distributed computation model. Individualunit activity is simple, summing the weighted input from theinterconnections transformed by a non-linear function, and the power ofthe model lies in the configuration of the connections, in particulartheir topology and intensity.

Starting from the input units, which are provided with data on theproblem to solve, the computation propagates in parallel in the networkup to the output units that provide the result. A neural network is notprogrammed to execute a given activity, but is trained using anautomatic learning algorithm, by means of a series of examples of thereality to be modelled.

The MLP or Multi-Layer Perceptron model currently covers a goodpercentage of neural network applications to speech. The MLP modelneurone sums the input weighting it with the intensity of theconnections, passes this value to a non-linear function (logistic) anddelivers the output. The neurones are organised in levels: an inputlevel, one or more internal levels and an output level. The connectionbetween neurones of different levels is usually complete, whereasneurones of the same level are not interconnected.

With specific regard to speech recognition neural networks, onerecognition model in current use is illustrated in document EP 0 623914. This document substantially describes a neural network incorporatedin an automaton model of the patterns to be recognised. Each class isdescribed in terms of left-right automatons with cycles on states, andthe classes may be whole words, phonemes or other acoustic units. AMulti-Layer Perceptron neural network computes automaton state emissionprobability.

It is known however that neural network execution is very heavy in termsof the required computing power. In particular, a neural networkutilises a speech recognition system like the one described in theaforementioned document has efficiency problems in its sequentialexecution on a digital computer due to the high number of connections tocompute (for each one there is an input product for the weight of theconnection), which can be estimated as around 5 million products andaccumulations for each second of speech.

An attempt at solving this problem, at least in part, was made indocument EP 0 733 982, which illustrates a method for acceleratingneural network execution, for processing correlated signals. This methodis based on the principle that, since the input signal is sequential andevolves slowly over time in a continuous manner, it is not necessary tore-compute all the activation values of all the neurones for each input,but it suffices to propagate the differences with respect to theprevious input in the network. In other words, the operation is notbased on absolute values of neurone activation at time t, but on thedifference with respect to the activation at time t−1. Therefore at eachpoint of the network, if a neurone has, at time t, activationsufficiently similar to that of time t−1, it does not propagate anysignal forward, limiting the activity exclusively to those neurones withan appreciable change in activation level.

However, the problem remains, especially in the case of smallvocabularies that use only a small number of phonetic units. Indeed, inknown systems each execution of the neural network envisages computingof all output units, with an evident computation load for the system.

SUMMARY OF THE INVENTION

This invention proposes to solve the problem of how to speed up neuralnetwork execution, in particular in cases in which activation of all theoutput units is not necessary, leaving the functional characteristics ofthe overall system in which the neural network is used unchanged.

This and other purposes are achieved by the method and device foraccelerating neural network execution as claimed in the claims section.

The advantage of the invention is that it does not compute all theneural network unit output activations exhaustively, but in a targetedmanner instead according to the real needs of the decoding algorithm.

In this way not all the neural network output units are necessary,allowing for reduced mode network processing, reducing the processingload on the overall system.

BRIEF DESCRIPTION OF DRAWINGS

These and other features of the invention are clarified in the followingdescription of a preferred form of embodiment, given by way of exampleand by no means limiting, and in the annexed drawings in which:

FIG. 1 is a schematic representation of an MLP or Multi-Layer Perceptrontype neural network;

FIG. 2 is a schematic illustration of a speech recognition device,incorporating a neural network produced according to this invention,and;

FIG. 3 is a schematic illustration of an application of the methoddescribed in the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In reference to FIG. 1, an MLP or Multi-Layer Perception type neuralnetwork includes a generally very high number of computing units orneurones, organised in levels. These levels comprise an input level 8,an output level 14 and two intermediate levels, generally defined ashidden levels 10 and 12. Neurones of adjacent levels are connectedtogether by weighted connections, for example the connection W_(ij) thatunites neurone H_(j) of hidden level 12 with neurone N_(i) of outputlevel 14, and each connection envisages computing of the neurone inputvalue for the weight associated with the same connection. The greatnumber of connections to compute, which can be estimated as around 5million products and accumulations for each second of speech, highlightsan efficiency problem in the sequential execution of the neural networkon a digital computer.

For applications in the field of speech recognition, in terms ofcomputing units or neurones, a neural network of the type illustrated inFIG. 1 may have the following structure: Input level: 273 units(neurones) First hidden level: 315 units Second hidden level: 300 unitsOutput level: 686 units (Total units: 1,574)

and the following values in terms of the number of weighted connections:

between input level and first hidden level:

7,665 connections

between first hidden level and second hidden level:

94,500 connections

between second hidden level and output level:

205,800 connections

(Total connections: 307,965)

It is evident that about ⅔ of the connections are located between thesecond hidden level and the output level. In effect, this is theconnection level that has the greatest influence on the total computingpower required by the overall system.

The proposed solution envisages not computing all the neural networkoutput unit activations exhaustively, but in a targeted way according tothe real needs of the decoding algorithm used, as will be explained inmore detail later on.

Effectively, in many cases, at least in the field of speech recognition,it would suffice to compute only a subset of output units instead of thetotality of output units as is commonly the case in traditional typesystems.

To better understand the basic principle of this method and the relateddevice, let us now analyse a model of a recognition system producedaccording to this invention, with reference to FIG. 2.

The recognition model represented in FIG. 2 substantially comprises aneural network 4 incorporated in an automation model 2 of the patternsto recognise. Each class is described in terms of left-right automaton 2a, 2 b, 2 c (with cycles on states), and the classes can be whole words,phonemes or acoustic units. A Multi-Layer Perceptron neural network 4computes automaton state emission probability.

The input window, or input level 8, is 7 frames wide (each 10 ms long),and each frame contains 39 parameters (Energy, 12 Cepstral Coefficientsand their primary and secondary derivatives). The total number ofcomputing units or neurones in the input layer is indeed equal to7×39=273. The input window “runs” on a sequence of speech signal samples6, sampled on input from the preceding modules (signal processing,etc.).

The first hidden level 10 is divided into three feature extractionblocks, the first block 10 b for the central frame, and the other two 10a and 10 c for the left and right contexts. Each block is in turndivided into sub-blocks dedicated to considering the various types ofparameter (E, Cep, ΔE, ΔCep).

The second hidden level 12 transforms the space between the featuresextracted from the first hidden level 10 and a set of self-organisedfeatures (for example silence, stationary sounds, transitions, specificphonemes).

The output level 14 integrates this information estimating theprobability of emission of the states of the words or acoustic-phoneticunits used. The output is virtually divided into various parts, eachcorresponding to an automaton 2 a, 2 b, 2 c, which in turn correspondsto an acoustic-phonetic class. The acoustic-phonetic classes used areso-called “stationary-transition” units, which consist of all thestationary parts of the phonemes plus all the admissible transitionsbetween the phonemes themselves. The stationary units 2 a, 2 c aremodelled with one state, and each correspond to a network output unit,whereas the transition units are modelled with two states, for exampleunit 2 b in FIG. 2, and correspond to two network output units.

The advantage of the invention is that the weighted connections 13,which connect the second hidden level 12 computing units with those ofthe output level 14, can be re-defined on each occasion, working througha specific selection module 16 that permits selection of thoseconnections thereby reducing their number. In practical terms the module16 permits elimination of all connections afferent to output units notused at a given moment, thereby reducing the overall computing powernecessary for network execution.

The selection module 16 is in turn controlled by a processing module 18which, receiving speech information (Voc. Inf) input from therecognition system related to subsets of words that in that given momentthe system has to recognise, translates that information into specificcommands for the selection module 16, according to the method described.

According to the invention, it is possible to apply the method when thewords to be recognised are contained in a subset of the generalvocabulary of words that the speech recognition system is capable ofrecognising. The smaller the subset of words used, the greater theadvantages will be in terms of reducing the required computing power. Ineffect, when using speech recognition on small vocabularies, for examplethe 10 numbers, the active vocabulary does not contain all the phonemes;thus not all the outputs of the network are necessary for their decodingand recognition.

The method for accelerating execution of the neural network 4 envisagesthe execution of the following steps in the order given:

-   -   determining a subset of acoustic-phonetic units necessary for        recognising all the words contained in the word subset of the        general vocabulary;    -   eliminating from the neural network 4, by means of the selection        module 16, all the weighted connections (W_(ij)) afferent to        computing units (N_(i)) of hidden level 14 corresponding to        acoustic-phonetic units not contained in the previously        determined subset of acoustic-phonetic units, thereby obtaining        a compacted neural network 4′, optimised for recognition of the        words contained in that vocabulary subset;    -   exclusively executing, at each instant in time, the compacted        neural network (4′).

The unnecessary connection elimination phase is schematicallyrepresented in FIG. 3, which shows a highly simplified example of aneural network with just four neurones H₁. . . H₄ in hidden level 12 andeight neurones N₁. . . N₈ in the output level 14. In a traditional typeneural network each neurone in hidden level 12 is connected to all theneurones in the output level (14), and at each execution of the neuralnetwork, computes all the corresponding connections (W_(ij)).

Supposing in this case that the vocabulary subset only requires thepresence of the acoustic-phonetic units corresponding to output neuronesN₃ and N₅, highlighted in FIG. 3, the method described in the inventionpermits temporary elimination of all the weighted connections afferentto unused acoustic-phonetic units, and the reduction of neural networkexecution to a limited number of connections (connections W₃₁, W₃₂, W₅₁,W₅₂, W₃₃, W₅₃, W₃₄, W₅₄ highlighted in FIG. 3).

With regard to the choice of the acoustic-phonetic unit subset, itgenerally suffices to include only the stationary units and thetransition units of the phonemes that make up the words in the subset ofthe general vocabulary.

However, since there are a lot fewer stationary units than transitionunits, for example in the Italian language there are 27 to 391, it ispreferable to leave all the stationary units in any case and eliminateall the transition units not present in the active vocabulary.

Moreover, since stationary units 2 a and 2 c are modelled with a singlestate, and correspond to a single network output unit, whereas thetransition units are modelled with two states (and two network outputunits), the reduction in terms of network output units is in all casesvery high.

The general vocabulary subset used in a given moment by the speechrecognition system can be the union of several subsets of activevocabularies. For example, in the case of an automatic speechrecognition system for telephony applications, a first word subset couldconsist simply of the words “yes” and “no”, whereas a second subsetcould consist of the ten numbers from zero to nine, while the subsetused at a given moment could be the union of both subsets. The choice ofa vocabulary subset obviously depends on the complex of words that thesystem, at each moment, should be able to recognise.

Referring to the previously described neural network structure, with1,574 computing units or neurones and 307,965 weighted connections andconsidering a subset containing the ten numbers (0 to 9), applying themethod described in the invention would reduce the weighted outputconnections to 26,100, and the total weighted connections to 128,265(42% of the initial number). This would in theory save 58% of therequired computing power.

In the case of the ten numbers, the selected acoustic-phonetic unitswould be 27 stationary plus 36 transition (of which 24 with 2 states),that is to say, just 87 states out of the 683 total.

Although this example refers to the Italian language, it does not dependon language and the method is indeed entirely general.

In terms of the algorithm for implementing the method in a recognitionsystem, the method is structured into the following steps:

-   -   1) at the recognition initialisation phase computing the union        of active vocabularies required for recognition;    -   2) re-compacting the last level of the neural network always        leaving all the stationary units and only the transition units        contained in the active vocabulary;    -   3) executing only the re-compacted network at each instant in        time.

The experimental results obtained by implementing the illustrated methodin a speech recognition system, with the task of recognizing the tennumbers, gave an average reduction of 41% in required computing power.Vice-versa, as expected, there was no appreciable difference on largescale vocabularies (2000 words).

1. Method for accelerating neural network (4) execution in a speechrecognition system, for recognising words contained in a subset of ageneral vocabulary of words that the same system is capable ofrecognising, said neural network (4) comprising a number of computingunits organised in levels, among which at least one hidden level (12)and one output level (14), the computing units (H_(j)) of said hiddenlevel (12) being connected to the computing units (N_(i)) of said outputlevel (14) via weighted connections (W_(ij)), said computing units(N_(i)) of said output level (14) corresponding to acoustic-phoneticunits (2) of said general vocabulary, characterised in that it comprisesthe following steps: determining a subset of acoustic-phonetic unitsnecessary for recognising all the words contained in said generalvocabulary subset; eliminating from the neural network (4) all theweighted connections (W_(ij)) afferent to computing units (N_(i)) ofsaid output level (14) that correspond to acoustic-phonetic units notcontained in said previously determined subset of acoustic-phoneticunits, thus obtaining a compacted neural network (4′) optimised forrecognition of the words contained in said general vocabulary subset;executing, at each moment in time, only said compacted neural network(4′).
 2. Method according to claim 1, in which said acoustic-phoneticunits (2) comprise stationary units (2 a, 2 c) and transition units (2b), and said step of determining a subset of acoustic-phonetic unitsconsists of determining the stationary units (2 a, 2 c) and transitionunits (2 b) present in said general vocabulary subset.
 3. Methodaccording to claim 1, in which said acoustic-phonetic units (2) comprisestationary units (2 a, 2 c) and transition units (2 b), and said step ofdetermining a subset of acoustic-phonetic units consists of selectingall the stationary units (2 a, 2 c) and determining the transition units(2 b) present in said general vocabulary subset.
 4. Method according toclaim 2 or 3, in which said general vocabulary subset is the union ofseveral subsets of vocabularies active at any given moment.
 5. Speechrecognition system, comprising a neural network (4) with a number ofcomputing units organised in levels, including at least one hidden level(12) and one output level (14), the computing units (H_(j)) of saidhidden level (12) being connected to the computing units (N_(i)) of saidoutput level (14) via weighted connections (W_(ij)), said computingunits (N_(i)) of said output level (14) corresponding toacoustic-phonetic units (2) of a general vocabulary of words to berecognized, characterised in that it comprises means (18, 16) foraccelerating neural network (4) execution, for recognising wordscontained in a subset of said general vocabulary, said means (18, 16)comprising: a first module (18) for determining the subset ofacoustic-phonetic units necessary for recognising all the wordscontained in said general vocabulary subset; a second module (16) forselecting, from among the weighted connections (W_(ij)) connecting thecomputing units (H_(j)) of hidden level (12) with those of output level(14), the weighted connections afferent to computing units (N_(i))corresponding to acoustic-phonetic units contained in said subset ofacoustic-phonetic units determined by said first module (16), thusobtaining a compacted neural network (4′) optimised for recognition ofthe words contained in said general vocabulary subset.
 6. Systemaccording to claim 5, in which the acoustic-phonetic units (2) comprisestationary units (2 a, 2 c) and transition units (2 b), and said firstmodule (18) comprises the means for determining both the stationaryunits (2 a, 2 c) and transition units (2 b) present in said generalvocabulary subset.
 7. System according to claim 5, in which saidacoustic-phonetic units (2) comprise stationary units (2 a, 2 c) andtransition units (2 b), and said first module (18) comprises means forselecting all the stationary units (2 a, 2 c) and determining thetransition units (2 b) present in said general vocabulary subset. 8.System according to claim 6 or 7, in which said general vocabularysubset is the union of several subsets of vocabularies active at anygiven moment.